44 research outputs found

    Recurrent Human Pose Estimation

    Full text link
    We propose a novel ConvNet model for predicting 2D human body poses in an image. The model regresses a heatmap representation for each body keypoint, and is able to learn and represent both the part appearances and the context of the part configuration. We make the following three contributions: (i) an architecture combining a feed forward module with a recurrent module, where the recurrent module can be run iteratively to improve the performance, (ii) the model can be trained end-to-end and from scratch, with auxiliary losses incorporated to improve performance, (iii) we investigate whether keypoint visibility can also be predicted. The model is evaluated on two benchmark datasets. The result is a simple architecture that achieves performance on par with the state of the art, but without the complexity of a graphical model stage (or layers).Comment: FG 2017, More Info and Demo: http://www.robots.ox.ac.uk/~vgg/software/keypoint_detection

    Gradient-based Uncertainty for Monocular Depth Estimation

    Full text link
    In monocular depth estimation, disturbances in the image context, like moving objects or reflecting materials, can easily lead to erroneous predictions. For that reason, uncertainty estimates for each pixel are necessary, in particular for safety-critical applications such as automated driving. We propose a post hoc uncertainty estimation approach for an already trained and thus fixed depth estimation model, represented by a deep neural network. The uncertainty is estimated with the gradients which are extracted with an auxiliary loss function. To avoid relying on ground-truth information for the loss definition, we present an auxiliary loss function based on the correspondence of the depth prediction for an image and its horizontally flipped counterpart. Our approach achieves state-of-the-art uncertainty estimation results on the KITTI and NYU Depth V2 benchmarks without the need to retrain the neural network. Models and code are publicly available at https://github.com/jhornauer/GrUMoDepth.Comment: Accepted to ECCV 202

    Automated Automotive Radar Calibration With Intelligent Vehicles

    Full text link
    While automotive radar sensors are widely adopted and have been used for automatic cruise control and collision avoidance tasks, their application outside of vehicles is still limited. As they have the ability to resolve multiple targets in 3D space, radars can also be used for improving environment perception. This application, however, requires a precise calibration, which is usually a time-consuming and labor-intensive task. We, therefore, present an approach for automated and geo-referenced extrinsic calibration of automotive radar sensors that is based on a novel hypothesis filtering scheme. Our method does not require external modifications of a vehicle and instead uses the location data obtained from automated vehicles. This location data is then combined with filtered sensor data to create calibration hypotheses. Subsequent filtering and optimization recovers the correct calibration. Our evaluation on data from a real testing site shows that our method can correctly calibrate infrastructure sensors in an automated manner, thus enabling cooperative driving scenarios.Comment: 5 pages, 4 figures, accepted for presentation at the 31st European Signal Processing Conference (EUSIPCO), September 4 - September 8, 2023, Helsinki, Finlan

    Pedestrian Environment Model for Automated Driving

    Full text link
    Besides interacting correctly with other vehicles, automated vehicles should also be able to react in a safe manner to vulnerable road users like pedestrians or cyclists. For a safe interaction between pedestrians and automated vehicles, the vehicle must be able to interpret the pedestrian's behavior. Common environment models do not contain information like body poses used to understand the pedestrian's intent. In this work, we propose an environment model that includes the position of the pedestrians as well as their pose information. We only use images from a monocular camera and the vehicle's localization data as input to our pedestrian environment model. We extract the skeletal information with a neural network human pose estimator from the image. Furthermore, we track the skeletons with a simple tracking algorithm based on the Hungarian algorithm and an ego-motion compensation. To obtain the 3D information of the position, we aggregate the data from consecutive frames in conjunction with the vehicle position. We demonstrate our pedestrian environment model on data generated with the CARLA simulator and the nuScenes dataset. Overall, we reach a relative position error of around 16% on both datasets.Comment: Accepted for presentation at the 26th IEEE International Conference on Intelligent Transportation Systems (ITSC 2023), 24-28 September 2023, Bilbao, Bizkaia, Spai

    Point Transformer

    Full text link
    In this work, we present Point Transformer, a deep neural network that operates directly on unordered and unstructured point sets. We design Point Transformer to extract local and global features and relate both representations by introducing the local-global attention mechanism, which aims to capture spatial point relations and shape information. For that purpose, we propose SortNet, as part of the Point Transformer, which induces input permutation invariance by selecting points based on a learned score. The output of Point Transformer is a sorted and permutation invariant feature list that can directly be incorporated into common computer vision applications. We evaluate our approach on standard classification and part segmentation benchmarks to demonstrate competitive results compared to the prior work. Code is publicly available at: https://github.com/engelnico/point-transforme

    Localizing Spatial Information in Neural Spatiospectral Filters

    Full text link
    Beamforming for multichannel speech enhancement relies on the estimation of spatial characteristics of the acoustic scene. In its simplest form, the delay-and-sum beamformer (DSB) introduces a time delay to all channels to align the desired signal components for constructive superposition. Recent investigations of neural spatiospectral filtering revealed that these filters can be characterized by a beampattern similar to one of traditional beamformers, which shows that artificial neural networks can learn and explicitly represent spatial structure. Using the Complex-valued Spatial Autoencoder (COSPA) as an exemplary neural spatiospectral filter for multichannel speech enhancement, we investigate where and how such networks represent spatial information. We show via clustering that for COSPA the spatial information is represented by the features generated by a gated recurrent unit (GRU) layer that has access to all channels simultaneously and that these features are not source -- but only direction of arrival-dependent.Comment: Submitted to the 31st European Signal Processing Conference (EUSIPCO 2023), Helsinki, Finland. 5 pages, 3 figure

    Data-Free Backbone Fine-Tuning for Pruned Neural Networks

    Full text link
    Model compression techniques reduce the computational load and memory consumption of deep neural networks. After the compression operation, e.g. parameter pruning, the model is normally fine-tuned on the original training dataset to recover from the performance drop caused by compression. However, the training data is not always available due to privacy issues or other factors. In this work, we present a data-free fine-tuning approach for pruning the backbone of deep neural networks. In particular, the pruned network backbone is trained with synthetically generated images, and our proposed intermediate supervision to mimic the unpruned backbone's output feature map. Afterwards, the pruned backbone can be combined with the original network head to make predictions. We generate synthetic images by back-propagating gradients to noise images while relying on L1-pruning for the backbone pruning. In our experiments, we show that our approach is task-independent due to pruning only the backbone. By evaluating our approach on 2D human pose estimation, object detection, and image classification, we demonstrate promising performance compared to the unpruned model. Our code is available at https://github.com/holzbock/dfbf.Comment: Accpeted for presentation at the 31st European Signal Processing Conference (EUSIPCO) 2023, September 4-8, 2023, Helsinki, Finlan
    corecore